Introduction
This dataset is a record of 3.5 Million+ US Domestic Flights from 1990 to 2009. It has been taken from OpenFlights website which have a huge database of different travelling mediums across the globe.
Attributes Description
- Origin_airport: Three letter airport code of the origin airport
- Destination_airport: Three letter airport code of the destination airport
- Origin_city: Origin city name
- Destination_city: Destination city name
- Passengers: Number of passengers transported from origin to destination
- Seats: Number of seats available on flights from origin to destination
- Flights: Number of flights between origin and destination (multiple records for one month, many with flights > 1)
- Distance: Distance (to nearest mile) flown between origin and destination
- Fly_date: The date (yyyymm) of flight
- Origin_population: Origin city’s population as reported by US Census
- Destination_population: Destination city’s population as reported by US Census
For the purpose of visualization and data analysis I have geocoded all the airports with the help of batch geocoding. So You are likely to find locations dataset many times in this report. It contains coordinates for all the airports in dataset.
Loading data and Libraries
For this kernel I have used a bunch of libraries that I find useful for visualisation and data preprocessing. For loading the data I am using read_csv which loads the data as a tibble. read_csv is faster than traditional read.csv method and loads data in as numeric & character unlike read.csv which considers every attribute as factors.
library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(highcharter)
library(leaflet)
library(psych)
traindata <- read_tsv('chimps_16091-2010-08-03_17-08-31/flight_edges.tsv',
col_names = c('Origin_airport',
'Destination_airport',
'Origin_city',
'Destination_city',
'Passengers',
'Seats',
'Flights',
'Distance',
'Fly_date',
'Origin_population',
'Destination_population'))Preprocessing of data
This dataset is huge and acquires about 259 mb of RAM. Unfortunately, my system won’t able to proces this big data for visualisation so I have selected data of two recent years for this kernel.
This dataset contains date as yyyymm format. Converting this format to R’s Date is a little bit tricky. To convert this format I have first added dd at the end of attribute and then converted it using as.Date function. Moreover, I extracted State code from Origin and Destination cities to track different states of the airports.
#Converting date attribute to date format
traindata$Fly_date <- as.character(traindata$Fly_date)
traindata$Fly_date <- paste(traindata$Fly_date, '01',sep = '')
traindata$Fly_date <- as.Date(traindata$Fly_date, '%Y%m%d')
#choosing less time for better performance
traindata <- traindata[year(traindata$Fly_date) >= 2008,]
traindata$Origin_code <- sub('.*[, ]', '', traindata$Origin_city)
traindata$Destination_code <- sub('.*[, ]', '', traindata$Destination_city)Summary and box plots of data
Before diving into visualisation we need to take a look at the summary and skewness of the data. Here is a summary of all attributes in the data.
summary(traindata)## Origin_airport Destination_airport Origin_city
## Length:442963 Length:442963 Length:442963
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Destination_city Passengers Seats Flights
## Length:442963 Min. : 0 Min. : 0 Min. : 0.00
## Class :character 1st Qu.: 64 1st Qu.: 124 1st Qu.: 2.00
## Mode :character Median : 786 Median : 1200 Median : 20.00
## Mean : 2441 Mean : 3226 Mean : 33.84
## 3rd Qu.: 3043 3rd Qu.: 4110 3rd Qu.: 49.00
## Max. :89597 Max. :116785 Max. :960.00
## Distance Fly_date Origin_population
## Min. : 0.0 Min. :2008-01-01 Min. : 13005
## 1st Qu.: 285.0 1st Qu.:2008-06-01 1st Qu.: 1090408
## Median : 540.0 Median :2008-12-01 Median : 2158643
## Mean : 706.4 Mean :2008-12-05 Mean : 6092430
## 3rd Qu.: 926.0 3rd Qu.:2009-06-01 3rd Qu.: 8806874
## Max. :4983.0 Max. :2009-12-01 Max. :38139592
## Destination_population Origin_code Destination_code
## Min. : 13005 Length:442963 Length:442963
## 1st Qu.: 1090408 Class :character Class :character
## Median : 2158643 Mode :character Mode :character
## Mean : 6138862
## 3rd Qu.: 8806874
## Max. :38139592
Missing values can make a lot of difference to outcomes of analysation of data. I plotted a bar graph to see No. of missing values in the data and fortunately this dataset has no missing values.
#no. of missing values in the data
as.data.frame(table(is.na(traindata))) %>%
hchart('column',hcaes(Var1, Freq)) %>% hc_title(text = 'No. of Missing Values') %>%
hc_add_theme(hc_theme_google())Let’s take a look at our numeric attributes through boxplots. Boxplots shows outliers and five number summary which gives us a bird eye view of numeric attributes.
#Lets look at some box plots of the data
temp <- sample_n(traindata, 200000)
plot_ly(type = 'box') %>%
add_boxplot(y = ~temp$Passengers, name = 'Passengers') %>%
add_boxplot(
y = ~temp$Seats, name = 'Seats')plot_ly(type= 'box') %>%
add_boxplot(
y = ~temp$Flights, name = 'Flights')plot_ly(type = 'box') %>%
add_boxplot( y = ~temp$Distance, name = 'Distance' )plot_ly(type = 'box') %>%
add_boxplot(y = ~temp$Origin_population, name = 'Origin City Population') %>%
add_boxplot(y = ~temp$Destination_population, name = "Destination city Population")Other than Population attributes, all numeric attributes consists of lot of outliers which shows noise and anomalies in the data.
Visualisations
Population in different cities
Here we take a look at the population of different states using a choropleth map.
temp <- traindata[!duplicated(traindata[,'Origin_city']),]
temp <- aggregate(temp$Destination_population,
by = list(temp$Origin_code), sum)
temp$x <- temp$x/1000000
hcmap("countries/us/us-all", data = temp, value = "x",
joinBy = c("hc-a2", "Group.1"), name = "State",
dataLabels = list(enabled = TRUE, format = '{point.name}'),
borderColor = "#FAFAFA", borderWidth = 0.1,
tooltip = list(valueDecimals = 2, valueSuffix = ' Million')) %>%
hc_colorAxis(dataClasses = color_classes(c(0,5,seq(10,90,by = 20),100))) %>%
hc_legend(layout = 'vertical', align = 'right', floating = TRUE, valueDecimals = 0,
valueSuffix = 'M') %>%
hc_title(text = 'Population in different states')Through this choropleth map we have following observations:
- Florida and California are most populated states in USA
- Alaska, Rhode Island and New Hampshire are least populated states in USA
Does Population of the cities affect the airlines ?
This graph show the population of the city as Orange bar whereas blue bar represents Passengers boarded from this city in past 2 years.
temp <- traindata[!duplicated(traindata[,'Origin_city']),]
temp <- aggregate(temp$Origin_population,
by = list(temp$Origin_code), sum)
colnames(temp)[2] <- 'Population'
passenger_population <- aggregate(traindata$Passengers,
by = list(traindata$Origin_code),
sum)
passenger_population <- left_join(passenger_population, temp, by = "Group.1")
colnames(passenger_population)[2] <- 'Passengers'
passenger_population$Group.1 <- factor(passenger_population$Group.1, levels = passenger_population$Group.1[order(passenger_population$Population)])
plot_ly(passenger_population, x= ~Group.1, y= ~Passengers, type ='bar', name = 'Passengers') %>%
add_bars( y = ~Population, name = 'Population') %>%
layout(yaxis = list(title = 'Number of People', barmode = 'group'),
title = 'Origin_population vs Passengers boarded from Origin' )Through this graph we see that population does affect the number of passengers boarding the flight. There are some cities that doesn’t follow this trend for eg, Delaware, Indiana and West Virginia.
Best months for the airlines
Months also determine the profit of the airlines. Most people like to travel when whether is quite nice and sunny instead of cold winters. Let’s see what trend does this dataset shows in months.
#which months is most active month for flights
traindata$month <- months(traindata$Fly_date)
temp <- aggregate(traindata$Flights,
by = list(factor(traindata$month, levels= month.name),
factor(year(traindata$Fly_date))),
sum)
temp <- ts(temp$x, start = c(2008,1), end=c(2009,12), frequency = 12)
hchart(temp,
name = 'No. of flights') %>%
hc_add_theme(hc_theme_google()) %>% hc_title(text = 'Total flights in different months')#passengers in different months
temp <- aggregate(traindata$Passengers,
by = list(factor(traindata$month, levels= month.name),
year(traindata$Fly_date)),
sum)
Passengers <- ts(temp$x, start = c(2008,1), end=c(2009,12), frequency = 12)
hchart(Passengers) %>%
hc_add_theme(hc_theme_db()) %>% hc_title(text = 'Total Passengers in different months')These two different plots shows that people does like to travel in summers instead of winters. We see a deep dive in Passenger count and No. of flights in the months of January whereas airlines make most profit in July.
Is this google maps ?
It would be really nice if we can pinpoint the location of different airports mentioned in the dataset. Well here you are, I found out geocodes of all the airports in the USA and then selected only those which are present in this data. To present this information on this map I have used plotly library.
locations <- read_csv('Locations.csv')
locations$Longitude <- as.numeric(locations$Longitude)
locations$Latitude <- as.numeric(locations$Latitude)
temp <- data_frame(Address = unique(traindata$Destination_airport))
locations <- right_join(locations, temp, by = 'Address')
plot_geo() %>%
add_markers(x = ~locations$Longitude, y = ~locations$Latitude, size = locations$x,
alpha = 0.7) %>%
layout(geo = list(
scope = 'north america', showland = TRUE, landcolor = toRGB("gray95")
),title = 'Locations of the Airports') Hall Of Fame
This data contain about 19,000 different flights which are to much to be represented on map. So let’s countdown top 15 flights that were a hit among passengers. These flights carried most passengers in 2 years.
#Top 10 Flights on map
temp <- aggregate(traindata$Passengers,
by = list(traindata$Origin_airport,traindata$Destination_airport),
sum) %>% arrange(desc(x))
topflights <- head(temp, n = 30L)
index <- NA
for(i in 1:30){
for(j in i:30)
if(topflights$Group.1[i] ==topflights$Group.2[j] & topflights$Group.2[i] == topflights$Group.1[j]){
index <- c(index,i+1)
}
}
topflights <- topflights[-index[-1],]
topflights <- left_join(topflights,
read_csv('Locations.csv', col_names = c('Group.1',
'Origin_Latitude',
'Origin_Longitude')),
by = 'Group.1')
topflights <- left_join(topflights,
read_csv('Locations.csv', col_names = c('Group.2',
'Destination_Latitude',
'Destination_Longitude')),
by = 'Group.2')
plot_geo(locationmode = 'USA-states', sizes = c(100,250)) %>%
add_markers(x = ~topflights$Origin_Longitude,y = ~topflights$Origin_Latitude,
size = ~topflights$x,
text = ~paste(topflights$Group.1,'<br />',topflights$x/1e6,' million Passengers')) %>%
add_markers(x = ~topflights$Destination_Longitude, y =~topflights$Destination_Latitude,
size = ~topflights$x,
text = ~paste(topflights$Group.2,'<br />',topflights$x/1e6,' million Passengers')) %>%
add_segments(x = ~topflights$Origin_Longitude, xend= ~topflights$Destination_Longitude,
y = ~topflights$Origin_Latitude, yend = ~topflights$Destination_Latitude,
color = ~topflights$Group.1
) %>%
layout(geo = list(
scope = 'usa', showland = TRUE, landcolor = toRGB("gray95")
),title = 'Top 15 flights (In terms of Passengers)') The award of passenger’s choice goes to the flight from Los Angeles Airport to San Francisco International Airport. It carried around 2.9 million passengers in past 2 years. Followed by flight from Orlando Airport to Hartsfield-Jackson Atlanta International Airport with 2.7 million passengers.
Hall Of Shame
Not all flights were the best during these 2 years. Some of them suffered a lot. These are the flights that carried least passengers in past 2 years.
#LAST 10 flights on map
temp <- aggregate(traindata$Passengers,
by = list(traindata$Origin_airport,traindata$Destination_airport),
sum) %>% arrange(desc(x))
lastflights <- tail(temp, n = 15L)
lastflights <- left_join(lastflights,
read_csv('Locations.csv', col_names = c('Group.1',
'Origin_Latitude',
'Origin_Longitude')),
by = 'Group.1')
lastflights <- left_join(lastflights,
read_csv('Locations.csv', col_names = c('Group.2',
'Destination_Latitude',
'Destination_Longitude')),
by = 'Group.2')
plot_geo(sizes = c(1,30)) %>%
add_markers(x = ~lastflights$Origin_Longitude,y = ~lastflights$Origin_Latitude,
size = ~lastflights$x,
text = ~paste(lastflights$Group.1,'<br />',lastflights$x/1e6,' Passengers')) %>%
add_markers(x = ~lastflights$Destination_Longitude, y = ~lastflights$Destination_Latitude,
size = ~lastflights$x,
text = ~paste(lastflights$Group.1,'<br />',lastflights$x/1e6,' Passengers')) %>%
add_segments(x = ~lastflights$Origin_Longitude, xend= ~lastflights$Destination_Longitude,
y = ~lastflights$Origin_Latitude, yend = ~lastflights$Destination_Latitude,
color = ~lastflights$Group.1
) %>%
layout(geo = list(
scope = 'north america',
showland = TRUE, landcolor = toRGB("gray95"), showframe = FALSE
),title = 'Last 15 flights (In terms of Passengers)')temp <- aggregate(traindata$Distance, by = list(traindata$Origin_airport,
traindata$Destination_airport)
,max) %>% arrange(desc(x))All of these flights carried 0 passengers in past two years. Most of them were destined to Youngstown-Warren Regional Airport which makes it the least favourite airport in this dataset.
Passenger Vs Seats
Are there enough seats for this much passengers ?
#Passengers to seat ratio
temp <- aggregate(traindata$Passengers,
by=list(factor(traindata$month, levels = month.name),
year(traindata$Fly_date)), sum)
temp1 <- aggregate(traindata$Seats,
by= list(factor(traindata$month, levels = month.name),
year(traindata$Fly_date)), sum)
Passengers<- ts(temp$x, start = c(2008,1),end = c(2009,12), frequency = 12)
Seats <- ts(temp1$x, start=c(2008,1),end = c(2009,12), frequency = 12)
hchart(cbind(Passengers,Seats)) %>%
hc_add_theme(hc_theme_flat()) %>%
hc_title(text = 'Passengers vs Seats in Specific month ( Hover for more info)')Well it turned out that there is huge difference between the Number of Passengers and Available seats in the flights. Most of them even travel empty.
In the peak month of July 2008 there were about 13 million empty seats in the flights.
In the January 2008 the number of empty seats were 20 million+.
Not much profitable flights
These flights carried more luggage than the passengers themselves. These flights have highest empty seat count in this dataset.
#most empty flights in USA
traindata$left_seats <- traindata$Seats - traindata$Passengers
temp <- aggregate(traindata$left_seats,
by= list(traindata$Origin_airport,
traindata$Destination_airport),max) %>% arrange(desc(x))
colnames(temp) <- c('Origin_airport','Destination_airport','left_seats')
temp <- left_join(temp, traindata, by = c('Origin_airport','Destination_airport','left_seats'))
temp$left_seats <- temp$left_seats/temp$Flights
index <- NA
topemptyflights <- head(temp, n = 30L)
for(i in 1:30){
for(j in i:30)
if(topemptyflights$Origin_airport[i] == topemptyflights$Destination_airport[j] & topemptyflights$Origin_airport[j] == topemptyflights$Destination_airport[i]){
index <- c(index,i+1)
}
}
topemptyflights <- topemptyflights[-index[-1],]
topemptyflights <- left_join(topemptyflights,
read_csv('Locations.csv', col_names = c('Origin_airport',
'Origin_Latitude',
'Origin_Longitude')),
by = 'Origin_airport')
topemptyflights <- left_join(topemptyflights,
read_csv('Locations.csv', col_names = c('Destination_airport',
'Destination_Latitude',
'Destination_Longitude')),
by = 'Destination_airport')
plot_geo(sizes = c(1,250)) %>%
add_markers(x = ~topemptyflights$Origin_Longitude,y = ~topemptyflights$Origin_Latitude,
size = ~topemptyflights$left_seats,
text = ~paste(
topemptyflights$Origin_airport,
'<br />',
format(round(topemptyflights$left_seats,2), nsmall = 2),
' empty seats per flight','<br />',
topemptyflights$Flights, ' flights')) %>%
add_markers(x = ~topemptyflights$Destination_Longitude, y = ~topemptyflights$Destination_Latitude,
size = ~topemptyflights$left_seats,
text = ~paste(
topemptyflights$Destination_airport,
'<br />',
format(round(topemptyflights$left_seats,2), nsmall = 2),
' empty seats per flight', '<br />',
topemptyflights$Flights, ' flights')) %>%
add_segments(x = ~topemptyflights$Origin_Longitude, xend= ~topemptyflights$Destination_Longitude,
y = ~topemptyflights$Origin_Latitude, yend = ~topemptyflights$Destination_Latitude,
color = ~topemptyflights$Origin_airport
) %>%
layout(geo = list(
scope = 'north america',
showland = TRUE, landcolor = toRGB("gray95"),showframe = FALSE
),title = 'Top empty flights (Hover for more Info)'
)Longest Flights
These flights travels the largest distances among all others. Most of them are destined to Honolulu Internation Airport which is situated in Hawaii.
temp <- aggregate(traindata$Distance, by = list(traindata$Origin_airport,
traindata$Destination_airport)
,max) %>% arrange(desc(x))
longflights <- head(temp, 15L)
longflights <- left_join(longflights,
read_csv('Locations.csv', col_names = c('Group.1',
'Origin_Latitude',
'Origin_Longitude')),
by = 'Group.1')
longflights <- left_join(longflights,
read_csv('Locations.csv', col_names = c('Group.2',
'Destination_Latitude',
'Destination_Longitude')),
by = 'Group.2')
plot_geo(sizes = c(1,250)) %>%
add_markers(x = ~longflights$Origin_Longitude,y = ~longflights$Origin_Latitude,
size = ~longflights$x,
text = ~paste(longflights$Group.1,'<br />',longflights$x,' miles')) %>%
add_markers(x = ~longflights$Destination_Longitude, y = ~longflights$Destination_Latitude,
size = ~longflights$x,
text = ~paste(longflights$Group.1,'<br />',longflights$x,' miles')) %>%
add_segments(x = ~longflights$Origin_Longitude, xend= ~longflights$Destination_Longitude,
y = ~longflights$Origin_Latitude, yend = ~longflights$Destination_Latitude,
color = ~longflights$Group.1
) %>%
layout(geo = list(
scope = 'north america',
showland = TRUE, landcolor = toRGB("gray95"),showframe = FALSE
),title = 'Longest Flights (In terms of Distance)'
)Does Distance affect Passengers ?
temp <- aggregate(traindata$Passengers,
by = list(traindata$Distance,
traindata$Origin_airport,
traindata$Destination_airport),
sum) %>% arrange(desc(Group.1))
colnames(temp) <- c("Distance",'Origin_airport', 'Destination_airport',
'Passengers')
temp <- sample_n(temp,10000)
plot_ly(temp, x =~Distance, y =~Passengers, type = 'scatter', size = ~Distance,color = ~Distance,
text = ~paste('Distance :',temp$Distance,'<br />',
'Passengers :',temp$Passengers)) %>% hide_colorbar()Clearly Distance affects Passengers boarding flights.
Dependence of numeric attributes
At the end of this kernel I would like to depict the correlation between different numeric attributes in this dataset.
In this plot right triangle show correlation between the attributes.
left triangle shows relation through plots.
Histograms shows the skewness of the data.
pairs.panels(sample_n(traindata,10000)[,c(5:8,10,11)])Most information like dependency of the variables on one another are self explanatory through this graph. This graph summarises information of above plots.
Summary
This dataset is an outstanding collection of different attributes out of which some are dependent and some are independent. It mainly consits of two types of attributes :
- Numeric
- Character
Here are some main observations from visualisations:
Data is mostly skewed in numeric attributes and consists of outliers but doesn’t have any missing values. One might need to clean it before predictions.
Most Preferred flight is routed from Los Angeles Airport to San Fransico Airport and least Preferred flight mostly routed to Youngstown-Warren Regional Airport.
Passengers is main attribute that is driving this whole dataset.
Passengers are positively affected by Population of state and negatively affected by Distance travelled by the flight.
For prediction of Passengers one can use Adaptive boosting, Linear Regression because passengers is a numeric attribute and is dependent on mostly numeric attributes.
New features like tourist attractions, weather conditions can also be added for more detailed predictions and data analysation.